Using Semantic Analysis to Classify Search Engine Spam

نویسندگان

  • Andrew Westbrook
  • Russell Greene
چکیده

Search engines have tried many techniques to filter out these spam pages before they can appear on the query results page. In Section 2 we present a collection of current methods that are being used to combat spam. We introduce a new approach to spam detection in Section 3 that uses semantic analysis of textual content as a means of detecting spam. This new approach uses a series of content analyzers combined with a decision tree classifier to determine if a given webpage is spam. Section 4 discusses the implementation of our approach. Our architecture is augments the search engine Lucene by adding a Java-based spam classifier. The spam classifier makes use of the Wordnet word database and the machine learning library Weka to classify web documents as either spam or not spam. We describe the results of our work in Section 5 and finally present our conclusions and future work in Section 6.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Low cost page quality factors to detect web spam

Web spam is a big challenge for quality of search engine results. It is very important for search engines to detect web spam accurately. In this paper we present 32 low cost quality factors to classify spam and ham pages on real time basis. These features can be divided in to three categories: (i) URL features, (ii) Content features, and (iii) Link features. We developed a classifier using Resi...

متن کامل

A Novel Approach for Combating Spamdexing in Web using UCINET and SVM Light Tool

Search Engine spam is a web page or a portion of a web page which has been created with the intention of increasing its ranking in search engines. Web spamming refers to actions intended to mislead search engines and give some pages higher ranking than they deserve. Anyone who uses a search engine frequently has most likely encountered a high ranking page that consists of nothing more than a bu...

متن کامل

Detecting Stealth Web Pages That Use Click-Through Cloaking

Search spam is an attack on search engines’ ranking algorithms to promote spam links into top search ranking that they do not deserve. Cloaking is a wellknown search spam technique in which spammers serve one page to search-engine crawlers to optimize ranking, but serve a different page to browser users to maximize potential profit. In this experience report, we investigate a different and rela...

متن کامل

Spam Filtering using Contextual Network Graphs

This document describes a machine-learning solution to the spam-filtering problem. Spam-filtering is treated as a text-classification problem in very high dimension space. Two new text-classification algorithms, Latent Semantic Indexing (LSI) and Contextual Network Graphs (CNG) are compared to existing Bayesian techniques by monitoring their ability to process and correctly classify a series of...

متن کامل

Query expansion based on relevance feedback and latent semantic analysis

Web search engines are one of the most popular tools on the Internet which are widely-used by expert and novice users. Constructing an adequate query which represents the best specification of users’ information need to the search engine is an important concern of web users. Query expansion is a way to reduce this concern and increase user satisfaction. In this paper, a new method of query expa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002